Transforming Features





Kerry Back

Outliers and scaling

  • For neural nets and other methods (not forests), it is important to have predictors that are
    • on the same scale
    • free of outliers

Neural net example

For a neuron with

\[ y = \max(0, b + w_1x_1 + \cdots + w_n x_n)\]

  • to find the right \(w\)’s, it helps to have \(x\)’s of similar scales
  • multiplying an outlier by a weight \(w\) can produce an outlier \(y\)

Quantile transformer

There are many ways to take care of outliers and scaling, but we’ll just use one.

from sklearn.preprocessing import QuantileTransformer

transform = QuantileTransformer(
    output_distribution="normal"
)

Example: roeq in 2021-01

Distribution before (old) and after (new)

Pipelines

  • We put the transformation and the model in a pipeline.
  • Then we fit the pipeline and predict with the pipeline.
  • sklearn will remember the transformation, so it can apply it to new observations.
  • neural net example for rnk in 2021-01:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(transform, model)
pipe.fit(X, y)
pipe.score(X, y)
0.06097467915277932

Entire workflow: connect to database

from sqlalchemy import create_engine
import pymssql
import pandas as pd

server = "mssql-82792-0.cloudclusters.net:16272"
username = "user"
password = "" # paste password between quote marks
database = "ghz"

string = "mssql+pymssql://" + username + ":" + password + "@" + server + "/" + database

conn = create_engine(string).connect()

Download data

data = pd.read_sql(
    """
    select ticker, date, ret, roeq, mom12m
    from data
    where date='2021-01'
    """, 
    conn
)
data = data.dropna()
data['rnk'] = data.ret.rank(pct=True)

Define pipeline

from sklearn.preprocessing import QuantileTransformer
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline

transform = QuantileTransformer(
    output_distribution="normal"
)
model = MLPRegressor(
  hidden_layer_sizes=(4, 2),
  random_state=0
)
pipe = make_pipeline(transform, model)

Fit the pipeline

X = data[["roeq", "mom12m"]]
y = data["rnk"]

pipe.fit(X, y)


Workflow is same for random forest, except that we can just fit the model and skip the pipeline.